Building a scalable global data processing pipeline for large astronomical photometric datasets
نویسنده
چکیده
Astronomical photometry is the science of measuring the flux of a celestial object. Since its introduction in the 1970s the CCD has been the principle method of measuring flux to calculate the apparent magnitude of an object. Each CCD image taken must go through a process of cleaning and calibration prior to its use. As the number of research telescopes increases the overall computing resources required for image processing also increases. As data archives increase in size to Petabytes, the data processing challenge requires image processing approaches to evolve to continue to exceed the growing data capture rate. Existing processing techniques are primarily sequential in nature, requiring increasingly powerful servers, faster disks and faster networks to process data. Existing High Performance Computing solutions involving high capacity data centres are both complex in design and expensive to maintain, while providing resources primarily to high profile science projects. This research describes three distributed pipeline architectures, a virtualised cloud based IRAF, the Astronomical Compute Node (ACN), a private cloud based pipeline, and NIMBUS, a globally distributed system. The ACN pipeline processed data at a rate of 4 Terabytes per day demonstrating data compression and upload to a central cloud storage service at a rate faster than data generation. The primary contribution of this research however is NIMBUS, which is rapidly scalable, resilient to failure and capable of processing CCD image data at a rate of hundreds of Terabytes per day. This pipeline is implemented using a decentralised web queue to control the compression of data, uploading of data to distributed web servers, and creating web messages to identify the location of the data. Using distributed web queue messages, images are downloaded by computing resources distributed around the globe. Rigorous experimental evidence is presented verifying the horizontal scalability of the system which has demonstrated a processing rate of 192 Terabytes per day with clear indications that higher processing rates are possible.
منابع مشابه
DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge
The Data Activated Liu Graph Engine — DALiuGE — is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By...
متن کاملIn-flight calibration of the Herschel-SPIRE instrument*
SPIRE, the Spectral and Photometric Imaging REceiver, is the Herschel Space Observatory’s submillimetre camera and spectrometer. It contains a three-band imaging photometer operating at 250, 350 and 500 μm, and an imaging Fourier-transform spectrometer (FTS) covering 194−671 μm (447−1550 GHz). In this paper we describe the initial approach taken to the absolute calibration of the SPIRE instrume...
متن کاملAlgorithms for Large-Scale Astronomical Problems
Modern astronomical datasets are getting larger and larger, which already include billions of celestial objects and take up terabytes of disk space. Meanwhile, many astronomical applications do not scale well to such large amount of data, which raises the following question: How can we use modern computer science techniques to help astronomers better analyze large datasets? To answer this quest...
متن کاملData-intensive science: The Terapixel and MODISAzure projects
We live in an era in which scientific discovery is increasingly driven by data exploration of massive datasets. Scientists today are envisioning diverse data analyses and computations that scale from the desktop to supercomputers, yet often have difficulty designing and constructing software architectures to accommodate the heterogeneous and often inconsistent data at scale. Moreover, scientifi...
متن کاملInteractive Exploration on Large Genomic Datasets.
The prevalence of large genomics datasets has made the the need to explore this data more important. Large sequencing projects like the 1000 Genomes Project [1], which reconstructed the genomes of 2,504 individuals sampled from 26 populations, have produced over 200TB of publically available data. Meanwhile, existing genomic visualization tools have been unable to scale with the growing amount ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1502.02821 شماره
صفحات -
تاریخ انتشار 2015